This document contains all code that I’ve used to classify the peptide sequences given in the denovo_peptide_bacteria.tsv file. Unless you are interested in the methods used, feel free to skip ahead to the results.
These tabs are really just so I can keep a record of what I did (and what didn’t work when I was building the kraken2 database).
These lines of code take the protein sequences from the supplied .tsv file and convert them to a .fasta file that contains each peptide, names the peptides with a number and adds the description, taken from the previous sample name. It also makes individual .fasta files with each sequence (but I found that having almost 1,000,000 individual sequence files was a bit overwhelming).
peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0/denovo_peptide_bacteria.tsv', header=0, index_col=0, sep='\t')
names = list(peptides.index.values)
pep_list = list(peptides.loc[:, 'peptide'].values)
sequences = []
for n in range(len(names)):
record = SeqRecord(Seq(pep_list[n], IUPAC.protein),id='peptide'+str(n+1), description=names[n])
sequences.append(record)
#SeqIO.write(record, "/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides/peptide"+str(n+1)+".fasta", "fasta")
SeqIO.write(sequences, "/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/peptide_sequences.fasta", "fasta")
Note that these were run on a server and not locally, I just have these lines here to save the code used. I’ve had problems downloading the standard kraken2 database, and I think this seems to be linked to having NA’s in the NCBI taxonomy, but I get the error:
kraken2-build --download-library bacteria --db bac_protein --use-ftp #didn't work
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: FTP connection error: Network is unreachable
even after the edits to the rsync_from_ncbi.pl script, as suggested in this issue AND as suggested in this issue.
Running the standard library build on kronos didn’t work either:
wget http://github.com/DerrickWood/kraken2/archive/v2.0.8-beta.tar.gz
tar -xf v2.0.8-beta.tar.gz
cd kraken2-2.0.8-beta/
./install_kraken2.sh kraken2_dir
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2 anaconda3/bin/
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2-build anaconda3/bin/
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2-inspect anaconda3/bin/
kraken2-build --standard --db path_to_folder --protein --threads 24 --use-ftp
Output (before and after editing the rsync_from_ncbi.pl script, as above):
Downloading taxonomy tree data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 377 projects (934707 sequences, 269.17 Maa)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
rsync_from_ncbi.pl: unexpected FTP path (new server?) for na
So I have gone with the non-redundant protein database for now:
kraken2-build --download-library nr --db bac_protein --use-ftp --protein
kraken2-build --db bac_protein/ --download-taxonomy --use-ftp
kraken2-build --build --db bac_protein/ --threads 10
Output:
Creating sequence ID to taxonomy ID map (step 1)...
Found 7913/573044508 targets, searched through 753265413 accession IDs, search complete.
lookup_accession_numbers: 573036595/573044508 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [1h31m43.606s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 1460 bytes
Capacity estimation complete. [3m11.010s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 11 bits reserved for taxid.
Processed 1311 sequences (377139 bp)...
I let this run for ~5 days and this is as far as it got. But with the low numbers of classified sequences anyway, maybe this isn’t the way to go. Will try putting together a larger database.
Download files (using the scripts at this repository, edited by me to download ’_protein.faa.gz’ rather than ’_genomic.fna.gz’ files):
kraken2-build --download-taxonomy --db kraken2_protein/ --threads 12 --use-ftp
bacteria
archaea
viral
human
fungi
protozoa
#For each of the domains
perl download_bacteria_faa.pl
for file in bacteria/*.faa
do
kraken2-build --add-to-library $file --db kraken2_protein
done
#And then
kraken2-build --build --db kraken2_protein --threads 12 --protein
These each crashed at some point before downloading all files. As I don’t know enough perl to accurately change the files.
I wrote a script with the same steps as the perl scripts, but in Python. At the moment I have added only complete genomes, but these could be added to with all genomes at some point if useful. (This is a huge amount more for bacteria, though, but not sure how much this will change species assignments). This script is saved at this repository.
And then:
#For each domain
for file in bacteria/*.faa
do
kraken2-build --add-to-library $file --db kraken2_protein --threads 12
done
mv viral added_kraken2_protein
kraken2-build --build --db kraken2_protein --threads 12 --protein
So the database includes: - 362 archaea (total available including not complete = 1082) - 17,836 bacteria (total available including not complete = 190,989) - 11 fungi (total available including not complete = 328) - Human genome - 4 protozoa (total available including not complete = 94) - 9,756 viruses (total available including not complete = 10,153)
This is the code used to run Kraken2 using the fasta file created above and the protein database constructed above (with and without the confidence parameter).
kraken2 --use-names --threads 12 --db databases/kraken/kraken2_protein/ --memory-mapping peptides/peptide_sequences.fasta --output peptides/kraken2_out/peptides.kraken --report peptides/kraken2_out/peptides.kreport --confidence 0.1
No hits were found:
Loading database information... done.
1286515 sequences (11.99 Mbp) processed in 0.359s (214973.9 Kseq/m, 2002.90 Mbp/m).
0 sequences classified (0.00%)
1286515 sequences unclassified (100.00%)
I have then combined the protein sequences downloaded from NCBI above into one .fasta file and made a BLAST database with them:
cat fungi/*.faa > combined_fungi.faa
cat protozoa/*.faa > combined_protozoa.faa
cat archaea/*.faa > combined_archaea.faa
cat viral/*.faa > combined_viral.faa
cat bacteria/*.faa > combined_bacteria.faa
mkdir combined_faa
mv combined_fungi.faa combined_faa
mv combined_protozoa.faa combined_faa
mv combined_archaea.faa combined_faa
mv combined_bacteria.faa combined_faa
mv combined_viral.faa combined_faa
cp vertebrate_mammalian/GCF_000001405.39_GRCh38.p13_protein.tax.faa combined_faa/
cat combined_faa/*.faa > combined_all.faa
makeblastdb -in combined_all.faa -dbtype prot
I am then running blastp (blastp-short - optimized for amino acid residues <30) against this database (copied to ramdisk - this took ~10 days to run):
sudo mount -t ramfs none /scratch/ramdisk/
sudo cp -a protein_db_ncbi/ /scratch/ramdisk/
blastp -task blastp-short -query peptide_sequences.fasta -db /scratch/ramdisk/protein_db_ncbi/combined_all.faa -out all_peptides.out -num_threads 12 -outfmt '10 qseqid sseqid pident evalue bitscore length mismatch positive gaps'
I seem to end up with some kind of a match for ~1 in 10 peptides. Note that these all have high E values (even when the identity is 100%), but I assume that this is just because the peptides are too short for this to be calculated properly.
Combine assembly summaries to get NCBI taxon IDs:
domains = ['archaea', 'bacteria', 'fungi', 'protozoa', 'vertebrate_mammalian', 'viral']
path = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/download_domain_test/'
domain_list = []
for d in range(len(domains)):
this_dom = pd.read_csv(path+domains[d]+'/assembly_summary.txt', sep='\t', header=1, index_col=5)
cols = list(this_dom.columns)
cols.remove('organism_name')
this_dom.drop(cols, axis=1, inplace=True)
this_dom['domain'] = [domains[d] for x in range(this_dom.shape[0])]
print(this_dom)
domain_list.append(this_dom)
all_domains = pd.concat(domain_list)
all_domains.index = all_domains.index.map(str)
all_domains.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_domains.csv')
Make dictionaries of the peptide and sample names:
used_fasta = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/peptide_sequences_used.fasta'
used_dict = {}
for record in SeqIO.parse(used_fasta, "fasta"):
used_dict[record.id] = [record.description, record.seq]
with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/rename_peptides_original.dict', 'wb') as f:
pickle.dump(used_dict, f)
sample_dict = {}
table = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0/denovo_peptide_bacteria.tsv'
table = pd.read_csv(table, sep='\t', header=0, index_col=0)
rows = table.index.values
for r in rows:
if r in sample_dict: continue
sample_dict[r] = table.loc[r, 'Sample Name']
with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/sample_dict.dict', 'wb') as f:
pickle.dump(sample_dict, f)
Remove peptides for which the matches are below 100% identity:
peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out.csv', header=0, index_col=0)
peptides = peptides.loc[peptides["pident "] == 100]
peptides.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_reduced.csv')
Add taxa assignments to peptides table:
all_domains = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_domains.csv', index_col=0, header=0)
peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_reduced.csv', header=0)
peptides['organism_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['genus'] = ['NA' for x in range(peptides.shape[0])]
peptides['domain'] = ['NA' for x in range(peptides.shape[0])]
peptides['peptide_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['sample_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['sequence'] = ['NA' for x in range(peptides.shape[0])]
with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/rename_peptides_original.dict', 'rb') as f:
used_dict = pickle.load(f)
table = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0 2/denovo_peptide_bacteria.tsv'
sample_names = pd.read_csv(table, sep='\t', header=0, index_col=0)
sample_names_peptide = pd.read_csv(table, sep='\t', header=0, index_col=1)
sample_names_list = ['sample_name']
ma = 0
rows = list(peptides.index.values)
for a in range(len(rows)):
seq = peptides.loc[rows[a], 'sseqid ']
tid = int(seq.split('taxid|')[1])
names = all_domains.loc[tid, :].values
if not isinstance(names[0], str):
names = names[0]
peptides.loc[rows[a], 'organism_name'] = names[0]
peptides.loc[rows[a], 'domain'] = names[1]
peptides.loc[rows[a], 'genus'] = names[0].split(' ')[0]
pep = used_dict[peptides.loc[rows[a], 'qseqid ']]
pep_name = pep[0].split(' ')[1]
peptides.loc[rows[a], 'peptide_name'] = pep_name
samples = sample_names_peptide.loc[pep_name, 'Sample Name'].values
if len(samples) > ma:
prev_ma = int(ma)
ma = len(samples)
for c in range(prev_ma, ma):
sample_names_list.append('sample_name'+str(c+2))
peptides['sample_name'+str(c+2)] = ['NA' for x in range(peptides.shape[0])]
for b in range(len(samples)):
peptides.loc[rows[a], sample_names_list[b]] = samples[b]
peptides.loc[rows[a], 'sequence'] = str(pep[1])
peptides.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_taxa_3.csv')
Sort table to taxa assignments of peptides and an abundance table for samples vs peptides (here we check if all classifications of each peptide have the same organism name, or at least the same genus - otherwise these are removed):
taxa = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_taxa_4.csv', header=0, index_col=13)
taxa.drop(['qseqid ', 'Unnamed: 0.1'], axis=1, inplace=True)
taxa.drop_duplicates(keep='first', inplace=True)
unique_taxa_dict = {}
unique_taxa = list(set(taxa.index.values))
samples_all = []
all_samples_table = []
unique_taxa_added = []
domain_dict = {}
count = 0
for ut in unique_taxa:
count += 1
adding = False
#if count > 200: continue
if isinstance(taxa.loc[ut, 'organism_name'], str):
genus_adding = taxa.loc[ut, 'organism_name']
domain_dict[genus_adding] = taxa.loc[ut, 'domain']
adding = True
else:
this_tax = taxa.loc[ut, 'organism_name'].values
domain = taxa.loc[ut, 'domain'].values
if len(list(set(this_tax))) == 1:
genus_adding = list(set(this_tax))[0]
domain_dict[genus_adding] = domain[0]
adding = True
else:
this_genus = taxa.loc[ut, 'genus'].values
if len(list(set(this_genus))) == 1:
genus_adding = list(set(this_genus))[0]
domain_dict[genus_adding] = domain[0]
adding = True
if adding:
samples = list(taxa.loc[ut, :].values)
this_peptide_samples = []
if isinstance(samples[0], np.int64):
for a in range(len(samples)):
if isinstance(samples[a], str):
if 'sample' in samples[a]:
this_peptide_samples.append(samples[a])
else:
for b in range(len(samples)):
if isinstance(samples[0], str):
if isinstance(samples[b], str):
if 'sample' in samples[b]:
this_peptide_samples.append(samples[b])
else:
for c in range(len(samples[b])):
if isinstance(samples[b][c], str):
if 'sample' in samples[b][c]:
this_peptide_samples.append(samples[b][c])
if this_peptide_samples == []:
adding = False
if adding:
samples_all.append([ut, this_peptide_samples])
unique_taxa_dict[ut] = genus_adding
for a in range(len(this_peptide_samples)):
if this_peptide_samples[a] not in all_samples_table:
all_samples_table.append(this_peptide_samples[a])
unique_taxa_added.append(ut)
print(len(unique_taxa_dict), len(samples_all))
feat_table = pd.DataFrame(columns=all_samples_table, index=unique_taxa_added).fillna(0)
for a in range(len(samples_all)):
for b in range(len(samples_all[a][1])):
feat_table.loc[samples_all[a][0], samples_all[a][1][b]] = feat_table.loc[samples_all[a][0], samples_all[a][1][b]] + 1
feat_table.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table.csv')
feat_table.insert(0, 'taxa', ['NA' for x in range(feat_table.shape[0])])
feat_table.insert(1, 'domain', ['NA' for x in range(feat_table.shape[0])])
for pep in feat_table.index.values:
feat_table.loc[pep, 'taxa'] = unique_taxa_dict[pep]
feat_table.loc[pep, 'domain'] = domain_dict[unique_taxa_dict[pep]]
feat_table.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_full_w_tax.csv')
ft_grouped = feat_table.rename(index=unique_taxa_dict)
ft_grouped = ft_grouped.drop(['domain', 'taxa'], axis=1)
ft_grouped = ft_grouped.groupby(by=ft_grouped.index, axis=0).sum()
ft_grouped.insert(0, 'domain', ['NA' for x in range(ft_grouped.shape[0])])
for tax in ft_grouped.index.values:
ft_grouped.loc[tax, 'domain'] = domain_dict[tax]
ft_grouped.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv')
This reduces us from 42,776 peptides with classifications (of 1,286,515 original peptides) to 36,811, and gives an average of 214 peptides classified per sample.
Get Pfam database:
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
Run HMMER:
hmmsearch --tblout --domtblout Pfam-A.hmm peptide_sequences.fasta > Pfam_hmm.out
This gives hits (although probably not any that are over the inclusion threshold - they all have very high E values, likely because they are very short), but I haven’t looked at them yet. Waiting to see if this is something that we are interested in first.
This leaves us with 36,811 unique peptides (an average of 214 per sample) classified to six domains (i.e. bacteria, fungi, archaea, viral, protozoa and vertebrate mammalian (human)) with 6180 lowest classifications (genera or species) or 1770 genera.
The overall proportions of taxa classified to domains, genera and species. Here we have removed those genera with <0.2% relative abundance and those species (also includes genera as this is just the lowest possible classification) with <0.15% abundance. Note that this choice of 0.2%/0.15% is arbitrary, and is based on giving good visualisation results - it may be a result in itself that there are no genera that have a large number of classifications belonging to them, e.g. removing all genera with <1% relative abundance leaves us with only 6 genera: Aspergillus (2%), Bacillus (1.2%), Streptomyces (2.4%), Pseudomonas (2.4%), Homo (4.6%) and Paenibacillus (1.3%).
Colors are cycled, but labels that appear at the top in the legend are plotted at 0, while those that appear last are at the top.
feat_table_full = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_full_w_tax.csv', header=0, index_col=0)
species = list(feat_table_full.loc[:, 'taxa'].values)
domain = list(feat_table_full.loc[:, 'domain'].values)
genera = list(feat_table_full.loc[:, 'taxa'].values)
for g in range(len(genera)):
try:
genera[g] = genera[g].split(' ')[0]
except:
genera[g] = genera[g]
domain_unique, species_unique, genera_unique = list(set(domain)), list(set(species)), list(set(genera))
fig = plt.figure(figsize=(15,20))
ax1 = plt.subplot(291)
ax2 = plt.subplot(293)
ax3 = plt.subplot(297)
domain_plot = []
for a in domain_unique:
domain_plot.append(domain.count(a))
total = sum(domain_plot)
domain_plot = [(a/total)*100 for a in domain_plot]
genera_plot = []
for a in genera_unique:
genera_plot.append(genera.count(a))
total = sum(genera_plot)
genera_plot = [(a/total)*100 for a in genera_plot]
species_plot = []
for a in species_unique:
species_plot.append(species.count(a))
total = sum(species_plot)
species_plot = [(a/total)*100 for a in species_plot]
for a in range(len(genera_plot)):
genera_plot[a] = [genera_unique[a], genera_plot[a]]
for a in range(len(species_plot)):
species_plot[a] = [species_unique[a], species_plot[a]]
genera = pd.DataFrame(genera_plot, columns=['Genus', 'Count'])
genera.set_index('Genus', inplace=True)
genera = genera[genera.max(axis=1) > 0.2]
genera.sort_values(by='Count', ascending=False, inplace=True)
species = pd.DataFrame(species_plot, columns=['Species', 'Count'])
species.set_index('Species', inplace=True)
species = species[species.max(axis=1) > 0.15]
species.sort_values(by='Count', ascending=False, inplace=True)
print(species)
domain_colors = get_cols(len(domain_plot))
bottom=0
for a in range(len(domain_plot)):
ax1.bar(1, domain_plot[a], bottom=bottom, color=domain_colors[a], label=domain_unique[a].split('_')[0], edgecolor='k')
bottom += domain_plot[a]
ax1.legend(loc='upper left', bbox_to_anchor=(1, 1.02))
ax1.set_ylim([0, 100])
ax1.set_ylabel('Relative abundance (%)')
ax1.set_xticklabels([])
genera_colors = get_cols(genera.shape[0])
bottom=0
for a in range(len(list(genera.index.values))):
name = list(genera.index.values)[a]
ax2.bar(1, genera.loc[name, 'Count'], bottom=bottom, color=genera_colors[a], label=name, edgecolor='k')
bottom += genera.loc[name, 'Count']
ax2.legend(loc='upper left', bbox_to_anchor=(1, 1.02), ncol=2)
ax2.set_ylim([0, 43])
ax2.set_xticklabels([])
species_colors = get_cols(species.shape[0])
bottom=0
for a in range(len(list(species.index.values))):
name = list(species.index.values)[a]
ax3.bar(1, species.loc[name, 'Count'], bottom=bottom, color=species_colors[a], label=name, edgecolor='k')
bottom += species.loc[name, 'Count']
ax3.legend(loc='upper left', bbox_to_anchor=(1, 1.02))
ax3.set_ylim([0, 14.5])
ax3.set_xticklabels([])
ax1.set_title('Domain')
ax2.set_title('Genus')
ax3.set_title('Species')
plt.subplots_adjust(wspace=0.5)
plt.show()
All taxa classified to species where possible (genus if not). This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.transpose()
pos, npos, stress = transform_for_NMDS(ft)
Colored by disease:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(disease_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
Colored by treatment:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(treat_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
All taxa classified to genus. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft = ft.transpose()
pos, npos, stress = transform_for_NMDS(ft)
Colored by disease:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(disease_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
Colored by treatment:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(treat_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
All taxa classified to species where possible (genus if not). Any that are not classified as bacteria are removed. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft = ft.loc[ft['domain'] == 'bacteria'].fillna(value=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.transpose()
pos, npos, stress = transform_for_NMDS(ft)
Colored by disease:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(disease_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
Colored by treatment:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(treat_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
All taxa classified to genus. Any that are not classified as bacteria are removed. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft = ft.loc[ft['domain'] == 'bacteria'].fillna(value=0)
ft.drop(['domain'], axis=1, inplace=True)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft = ft.transpose()
pos, npos, stress = transform_for_NMDS(ft)
Colored by disease:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(disease_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
Colored by treatment:
sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
if a in disease_colors_samples:
sample_colors.append(treat_colors_samples[a])
else:
sample_colors.append('gray')
if a in type_markers_samples:
sample_markers.append(type_markers_samples[a])
else:
sample_markers.append('s')
plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()
Here I have transformed samples to relative abundance, grouped them by treatment type (taken the mean of each sample grouping) and plotted the abundance of all genera above 0.5% relative abundance. I have removed samples that didn’t appear in the ‘sample_centered_table.tsv’ file. Note that I haven’t sorted anything for those genera that aren’t a standard binomial Genus species name. If a genus is present in that sample grouping, the relative abundance is shown in the cell.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(10,40))
ax1 = plt.subplot(111)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
plt.tight_layout()
plt.show()
This shows the same as the previous heatmap grouped by treatment, but here the treatment groupings as well as the genera have been hierarchically clustered (using Bray-Curtis distance).
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax3)
Z = hierarchy.linkage(ft, 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='left')
y_labels, locs, xlocs, labels = list(ax3.get_yticklabels()), list(ax3.get_yticks()), list(ax3.get_xticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.index.values)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
plt.sca(ax1)
ft = ft.reindex(plot_labels)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.tight_layout()
plt.show()
This shows the same as the first heatmap grouped by treatment, but here the treatment groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by prevalence.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
#ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax1)
pres_abs = ft.copy()
for i in pres_abs.columns:
for j in pres_abs.index.values:
if pres_abs.loc[j, i] > 0: pres_abs.loc[j, i] = 1
ft['prevalence'] = pres_abs.mean(axis=1)
ft = ft.sort_values(by='prevalence')
ft = ft.drop(['prevalence'], axis=1)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.show()
This shows the same as the first heatmap grouped by treatment, but here the treatment groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by abundance.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
#ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax1)
ft['total'] = ft.mean(axis=1)
ft = ft.sort_values(by='total')
ft = ft.drop(['total'], axis=1)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.show()
Here I have transformed samples to relative abundance, grouped them by disease type (taken the mean of each sample grouping) and plotted the abundance of all genera above 0.5% relative abundance. I have removed samples that didn’t appear in the ‘sample_centered_table.tsv’ file. Note that I haven’t sorted anything for those genera that aren’t a standard binomial Genus species name. If a genus is present in that sample grouping, the relative abundance is shown in the cell.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
print(ft)
plt.figure(figsize=(15,60))
ax1 = plt.subplot(111)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
plt.tight_layout()
plt.show()
This shows the same as the previous heatmap grouped by disease, but here the disease groupings as well as the genera have been hierarchically clustered (using Bray-Curtis distance).
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax3)
Z = hierarchy.linkage(ft, 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='left')
y_labels, locs, xlocs, labels = list(ax3.get_yticklabels()), list(ax3.get_yticks()), list(ax3.get_xticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.index.values)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
plt.sca(ax1)
ft = ft.reindex(plot_labels)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.show()
This shows the same as the first heatmap grouped by disease, but here the disease groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by prevalence.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
#ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax1)
pres_abs = ft.copy()
for i in pres_abs.columns:
for j in pres_abs.index.values:
if pres_abs.loc[j, i] > 0: pres_abs.loc[j, i] = 1
ft['prevalence'] = pres_abs.mean(axis=1)
ft = ft.sort_values(by='prevalence')
ft = ft.drop(['prevalence'], axis=1)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.show()
This shows the same as the first heatmap grouped by disease, but here the disease groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by abundance.
ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)
ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}
for a in ft.index.values:
genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()
keeping = []
for col in ft.columns:
if 'sample' in col:
keeping.append(False)
else:
keeping.append(True)
ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]
plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
#ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)
Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')
y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]
plt.sca(ax1)
ft['total'] = ft.mean(axis=1)
ft = ft.sort_values(by='total')
ft = ft.drop(['total'], axis=1)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)
y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
row = ft.loc[a, :].values
whole = list(row)
strings = [str(round(b, 1)) for b in row]
#ma = max(row)
row = [m.to_rgba(b) for b in row]
ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
for b in range(len(strings)):
if whole[b] == 0: continue
if whole[b] > 2.5: color='k'
else: color='w'
ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
bottom = [x+1 for x in bottom]
all_y.append(bottom[0]-0.5)
ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()
plt.subplots_adjust(wspace=0.05)
plt.show()